Introduction

This lab provides a fairly brief introduction to using INLA for Bayesian analysis in R. The first section will cover using INLA for non-spatial models (LMs, GLMS and mixed-effects models),. We’ll then go on to look at using INLA for spatial regression models. In addition to INLA, R has a large number of add-on packages for Bayesian analysis listed here. Some notable ones are:

  • coda: provides routines for post-processing results
  • mcmcplots: provides prettier visualizations of MCMC results
  • MCMCpack: provides mainly standard statistical models in a simpler format
  • rstan: interface to the Stan modeling language
  • brms: simplified interface to Stan

Before starting the lab, you will need to set up a new folder for your working directory. Go to your geog6000 folder now and create a new folder for today’s class called lab11.

You’ll need the following files:

  • The GapMinder dataset: gapminderData5.csv
  • A dataset from a study of Irish schoolchildren irished.csv
  • A dataset of UFO sightings for US States nuforc.zip

You will need to download these files from Canvas, and move them from your Downloads folder to the datafiles folder that you made previously.

Now start RStudio and change the working directory to lab11. As a reminder, you can do this by going to the [Session] menu in RStudio, then [Change working directory]. This will open a file browser that you can use to browse through your computer and find the folder.

Installing INLA

The INLA package is not available through the usual CRAN repositories. Instead, you’ll need to install this manually. The following code will install the current stable version of INLA:

install.packages("INLA",repos=c(getOption("repos"),INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

Once this is installed, we can load the libraries that we’ll need for this lab

library(dplyr)
library(ggplot2)
library(lme4)
library(sf)
library(INLA)
library(spdep)
library(tmap)

Non-spatial models

Linear regression

We’ll start by using INLA to build a simple linear regression model of life expectancy (lifeExp) as a function of per capita GDP (gpdPercap) from the GapMinder data. Start by a) loading the data; b) centering the years on 1952; and c) log transforming GDP (to check why we do this last step, try making a histogram of the untransformed and transformed GDP values).

gap <- read.csv("../datafiles/gapminderData5.csv")
gap$year2 <- gap$year - 1952
gap$gdpPercap2 <- log(gap$gdpPercap)

Next, make an OLS regression model using lm() as a comparison. With the increasing complexity of Bayesian models, it is quite useful to make a traditional model as a verification that your results are correct. In the summary(), take note of the coefficients, and the residual standard error (this is an estimate of how much residual variance you have)

fit_lm <- lm(lifeExp ~ gdpPercap2, gap)
summary(fit_lm)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap2, data = gap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.1009     1.2277  -7.413 1.93e-13 ***
## gdpPercap2    8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

Now let’s set up our INLA model. For this simple model, it’s pretty straightforward if we just use the default priors:

inla_lm <- inla(lifeExp ~ gdpPercap2, 
                data = gap)

By default, the INLA model will not calculate any model diagnostics (to make it as efficient as possible), so we can add the following arguments to get a) WAIC and DIC scores and b) model estimates of life expectancy for each observation:

inla_lm <- inla(lifeExp ~ gdpPercap2, 
                data = gap,
                control.compute = list(dic = TRUE, waic = TRUE),
                control.predictor = list(compute = TRUE))

If you want to see some of the detail of INLA working, add verbose=TRUE to the inla() function. Now let’s look at the output

summary(inla_lm)
## 
## Call:
##    c("inla(formula = lifeExp ~ gdpPercap2, data = gap, control.compute = 
##    list(dic = TRUE, ", " waic = TRUE), control.predictor = list(compute = 
##    TRUE))" ) 
## Time used:
##     Pre = 4.79, Running = 0.549, Post = 0.53, Total = 5.87 
## Fixed effects:
##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## (Intercept) -9.099 1.228    -11.510   -9.099     -6.691 -9.099   0
## gdpPercap2   8.405 0.149      8.113    8.405      8.697  8.405   0
## 
## Model hyperparameters:
##                                          mean    sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.017 0.001      0.016    0.017
##                                         0.975quant  mode
## Precision for the Gaussian observations      0.018 0.017
## 
## Expected number of effective parameters(stdev): 2.00(0.00)
## Number of equivalent replicates : 851.93 
## 
## Deviance Information Criterion (DIC) ...............: 11760.48
## Deviance Information Criterion (DIC, saturated) ....: 1707.78
## Effective number of parameters .....................: 3.03
## 
## Watanabe-Akaike information criterion (WAIC) ...: 11760.97
## Effective number of parameters .................: 3.51
## 
## Marginal log-Likelihood:  -5899.79 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

And there’s quite a lot. Note there is a table of fixed effects with the intercept and slope estimates, the WAIC and DIC, and the precision value. To show only the fixed effects,

inla_lm$summary.fixed
##                  mean        sd 0.025quant  0.5quant 0.975quant      mode
## (Intercept) -9.099393 1.2277173 -11.510106 -9.099428  -6.690708 -9.099395
## gdpPercap2   8.404902 0.1487672   8.112785  8.404897   8.696771  8.404902
##                      kld
## (Intercept) 1.500902e-09
## gdpPercap2  1.502466e-09

Note that as this is a Bayesian model, we don’t get a standard error and \(p\)-value, instead we get a description of the posterior distribution of each coefficient, so the summary gives the mean, median, mode, s.d. and the 2.5 and 97.5 percentiles (0.025quant and 0.975quant). As a quick rule of thumb, if these percentiles have the same sign, we can assume that there is a credible effect (note the difference in language here). You can get a quick plot of these posterior distributions with INLA’s plot() function. By default, this will try to plot everything from the model, so it is easier to only select the fixed effects for now:

plot(inla_lm, plot.fixed.effects = TRUE,
     plot.random.effects = FALSE,
     plot.hyperparameters = FALSE,
     plot.predictor = FALSE)

In the linear model, we got an estimate of the residual standard deviation of around 7.6. This is a measure of the size of the errors from the model. In the INLA output, we get an estimate of the residual precision instead, which is the inverse of the variance. You can extract this with:

inla_lm$summary.hyperpar
##                                               mean           sd 0.025quant
## Precision for the Gaussian observations 0.01724183 0.0005814291 0.01611705
##                                           0.5quant 0.975quant       mode
## Precision for the Gaussian observations 0.01723434 0.01840503 0.01721826

To compare, we can invert this and take the square root to convert back to standard deviation units, and to check that the models are estimating the same value:

sqrt(1 / 0.0172)
## [1] 7.624929

Which matches the estimate from the lm() function. And we can plot the posterior distribution:

plot(inla_lm, plot.fixed.effects = FALSE,
     plot.random.effects = FALSE,
     plot.hyperparameters = TRUE,
     plot.predictor = FALSE)

As INLA calculated the Bayesian posterior estimates for each observation, we can compare these to the observed life expectancy. The posterior estimates are in:

inla_lm$summary.linear.predictor

We’ll just plot the mean of these predictions against the observations:

plot(gap$lifeExp, inla_lm$summary.linear.predictor$mean)
abline(0,1)

Generalized models

In the next example, we’ll build a generalized linear model (GLM) using the Irish education dataset. In this dataset, we are interested in modeling the probability of a student taking the leaving certificate (lvcert) using a verbal reasoning score (DVRT). Load the data and center the DVRT scores:

irished <- read.csv("../datafiles/irished.csv")
irished$DVRT.cen <- irished$DVRT - mean(irished$DVRT)

Build a standard glm for comparison:

fit_glm <- glm(lvcert ~ DVRT.cen, 
               data = irished, 
               family = binomial(link = 'logit'))

summary(fit_glm)
## 
## Call:
## glm(formula = lvcert ~ DVRT.cen, family = binomial(link = "logit"), 
##     data = irished)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9077  -0.9810  -0.4543   1.0307   2.1552  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.278422   0.099665  -2.794  0.00521 ** 
## DVRT.cen     0.064369   0.007576   8.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 686.86  on 499  degrees of freedom
## Residual deviance: 593.77  on 498  degrees of freedom
## AIC: 597.77
## 
## Number of Fisher Scoring iterations: 3

Again, note the coefficient estimates. Now let’s build the same model with INLA. Here, we simply specify the family argument (the default is gaussian), and use the same arguments as before to estimate model diagnostics:

inla_glm <- inla(lvcert ~ DVRT.cen, 
                 data = irished, 
                 family = "binomial",
                 control.compute = list(dic = TRUE, waic = TRUE),
                 control.predictor = list(compute = TRUE))
summary(inla_glm)
## 
## Call:
##    c("inla(formula = lvcert ~ DVRT.cen, family = \"binomial\", data = 
##    irished, ", " control.compute = list(dic = TRUE, waic = TRUE), 
##    control.predictor = list(compute = TRUE))" ) 
## Time used:
##     Pre = 5.54, Running = 0.323, Post = 0.383, Total = 6.24 
## Fixed effects:
##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## (Intercept) -0.279 0.100     -0.476   -0.279     -0.084 -0.278   0
## DVRT.cen     0.065 0.008      0.050    0.064      0.080  0.064   0
## 
## Expected number of effective parameters(stdev): 2.00(0.00)
## Number of equivalent replicates : 249.92 
## 
## Deviance Information Criterion (DIC) ...............: 597.75
## Deviance Information Criterion (DIC, saturated) ....: 597.75
## Effective number of parameters .....................: 1.99
## 
## Watanabe-Akaike information criterion (WAIC) ...: 597.79
## Effective number of parameters .................: 2.02
## 
## Marginal log-Likelihood:  -306.61 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

As before, we can plot out the posterior distributions of the intercept and slope. How do these compare to the glm output?

plot(inla_glm, plot.fixed.effects = TRUE,
     plot.random.effects = FALSE,
     plot.hyperparameters = FALSE,
     plot.predictor = FALSE)

Note that there is no separate estimate of residual variance in a binomial model.

Mixed-effects models

Next, we’ll estimate a mixed-effects model for the Gap Minder data, with a random intercept. First, we’ll build a comparison model using lmer() from the lme4 package. We’ll again model the trend of life expectancy over time within the dataset as a linear fixed effect:

fit_lmer <- lmer(lifeExp ~ year2 + (1 | country), gap)
summary(fit_lmer)
## Linear mixed model fit by REML ['lmerMod']
## Formula: lifeExp ~ year2 + (1 | country)
##    Data: gap
## 
## REML criterion at convergence: 9866.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.1692 -0.4910  0.1216  0.6030  2.4194 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  country  (Intercept) 123.15   11.097  
##  Residual              12.85    3.584  
## Number of obs: 1704, groups:  country, 142
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 50.51208    0.94547   53.43
## year2        0.32590    0.00503   64.79
## 
## Correlation of Fixed Effects:
##       (Intr)
## year2 -0.146

In INLA, additional random effects can be added with the f() function. These include a large number of effects, including groups (as here) and spatial and temporal effects (see below). The format is f(ID, model), where ID is an index in the data set identifying groups or other structures and model is the type of random effect. Here, we’ll use the iid model with assumes that model errors are i.i.d. within countries.

inla_lmer <- inla(lifeExp ~ year2 + f(country, model = "iid"), 
                  data = gap,
                 control.compute = list(dic = TRUE, waic = TRUE),
                 control.predictor = list(compute = TRUE))
summary(inla_lmer)
## 
## Call:
##    "inla(formula = lifeExp ~ year2 + f(country, model = \"iid\"), data = 
##    gap)" 
## Time used:
##     Pre = 6.04, Running = 0.558, Post = 0.343, Total = 6.94 
## Fixed effects:
##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## (Intercept) 50.512 0.947     48.651   50.512     52.372 50.512   0
## year2        0.326 0.005      0.316    0.326      0.336  0.326   0
## 
## Random effects:
##   Name     Model
##     country IID model
## 
## Model hyperparameters:
##                                          mean    sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.078 0.003      0.072    0.078
## Precision for country                   0.008 0.001      0.006    0.008
##                                         0.975quant  mode
## Precision for the Gaussian observations      0.083 0.078
## Precision for country                        0.010 0.008
## 
## Expected number of effective parameters(stdev): 141.77(0.154)
## Number of equivalent replicates : 12.02 
## 
## Marginal log-Likelihood:  -4968.37

Check that the fixed effects are similar to those from lmer(). Again, you can plot these:

plot(inla_lmer, plot.fixed.effects = TRUE,
     plot.random.effects = FALSE,
     plot.hyperparameters = FALSE,
     plot.predictor = FALSE)

If we plot the hyperparameters, we now get the posterior distribution for the residual variance and the variance explained by the random intercepts.

plot(inla_lmer, plot.fixed.effects = FALSE,
     plot.random.effects = FALSE,
     plot.hyperparameters = TRUE,
     plot.predictor = FALSE, 
     plot.prior = TRUE)

We can compare these by extracting these values and back converting to variance. First, the lmer() effects

VarCorr(fit_lmer)
##  Groups   Name        Std.Dev.
##  country  (Intercept) 11.0972 
##  Residual              3.5842
inla_lmer$summary.hyperpar
##                                                mean          sd  0.025quant
## Precision for the Gaussian observations 0.078002830 0.002800661 0.072475583
## Precision for country                   0.008289739 0.001034976 0.006270206
##                                            0.5quant 0.975quant        mode
## Precision for the Gaussian observations 0.078025292 0.08347545 0.078173381
## Precision for country                   0.008302706 0.01030871 0.008409998
# Residual s.d.
sqrt(1 / 0.078)
## [1] 3.580574
# Intercept s.d.
sqrt(1 / 0.0082)
## [1] 11.04315

We can get the random effect for any country (the offset from the grand mean intercept). These are in inla_lmer$summary.random. Each row has the country name (this was the index used in the f() function) as well as the summary of the posterior distribution for that country. To find the random effect for Japan:

rowid <- which(inla_lmer$summary.random$country == "Japan")
inla_lmer$summary.random$country[rowid, ]
##       ID     mean       sd 0.025quant 0.5quant 0.975quant     mode          kld
## 67 Japan 15.21874 1.386968   12.49579 15.21834   17.94135 15.21765 1.318092e-08

Indicating that Japan’s life expectancy is about 15 years above the global average over the period Gap observations, and the 2.5 and 97.5 percentile indicate that this is credible higher than average. We can also plot the marginal distribution of this estimate (this is the distribution), by extracting the relevant values from the inla_lmer$marginals.random:

japan_re <- inla_lmer$marginals.random$country[[rowid]]
plot(japan_re, type = 'l', xlab = "LifeExp", ylab = "p")

Spatial models

We’ll now turn to building spatial models with INLA. In the nuforc.zip file there is a shapefile with the counts of UFO sightings for US states between 1990 and 2019. Let’s load this now:

ufos <- st_read("../datafiles/nuforc/ufo_sf_annual.shp")
## Reading layer `ufo_sf_annual' from data source 
##   `/Users/u0784726/Dropbox/Data/devtools/geog6000/datafiles/nuforc/ufo_sf_annual.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1421 features and 44 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.71 ymin: 24.54235 xmax: -66.98702 ymax: 49.36967
## Geodetic CRS:  WGS 84

This contains quite a lot of information, but our models, we’ll need the count data (count) and the population size. We’ll subset out only 2014 using dplyr (you can also do this with ufos_2014 <- subset(ufos, year == 2014)):

ufos_2014 <- ufos %>%
  filter(year == 2014)

And now use tmap to map out the counts and population size:

p1 <- tm_shape(ufos_2014) + tm_fill("count", style = "jenks") +
  tm_borders() +
  tm_layout(main.title = "UFO sightings 2014")

p2 <- tm_shape(ufos_2014) + tm_fill("population", style = "jenks") +
  tm_borders() +
  tm_layout(main.title = "State population 2014")

tmap_arrange(p1, p2)

We’ll start with a simple model of the count of sightings as a function of population size. We’ll use this as a demonstration, even though the outcome are counts, and a Poisson model would be a better approach (we’ll get to this shortly).

ufo_lm <- lm(count ~ population, ufos_2014)
summary(ufo_lm)
## 
## Call:
## lm(formula = count ~ population, data = ufos_2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -261.79  -30.40  -12.68   22.04  219.68 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.452e+01  1.480e+01   2.332    0.024 *  
## population  1.988e-03  1.543e-04  12.882   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.66 on 47 degrees of freedom
## Multiple R-squared:  0.7793, Adjusted R-squared:  0.7746 
## F-statistic: 165.9 on 1 and 47 DF,  p-value: < 2.2e-16

And now we’ll build the same using INLA:

ufo_inla_lm <- inla(count ~ population,
                    data = ufos_2014,
                    control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE),
                    control.predictor = list(compute = TRUE))

Let’s check the coefficients against the OLS model:

ufo_inla_lm$summary.fixed
##                     mean           sd  0.025quant     0.5quant   0.975quant
## (Intercept) 34.524391174 1.485288e+01 5.251553181 34.523952272 63.756196425
## population   0.001987856 1.548172e-04 0.001682735  0.001987852  0.002292551
##                     mode          kld
## (Intercept) 34.524398804 2.327035e-06
## population   0.001987856 2.321022e-06

We’ll now plot the posterior distribution of the coefficients and the estimate of the residual variance. Rather than using INLA’s plot() function, we’ll get the values directly from the model output and plot them. The posterior distribution of the fixed effects can be obtained from:

ufo_inla_lm$marginals.fixed

And the residual variance from:

ufo_inla_lm$marginals.hyperpar
  • Plot posterior distribution of intercept
plot_df <- as.data.frame(inla.tmarginal(function(x) x, 
                                        ufo_inla_lm$marginals.fixed$`(Intercept)`))
ggplot(plot_df, aes(x = x, y = y)) +
  geom_line(size = 2) +
  scale_x_continuous("Intercept") +
  scale_y_continuous("P") +
  theme_bw()

  • Plot posterior distribution of population coefficient
plot_df <- as.data.frame(inla.tmarginal(function(x) x, 
                                        ufo_inla_lm$marginals.fixed$population))
ggplot(plot_df, aes(x = x, y = y)) +
  geom_line(size = 2) +
  scale_x_continuous("Population") +
  scale_y_continuous("P") +
  theme_bw()

  • Plot posterior distribution of residual variance
plot_df <- as.data.frame(inla.tmarginal(function(x) 1/x, 
                                        ufo_inla_lm$marginals.hyperpar$`Precision for the Gaussian observations`))
ggplot(plot_df, aes(x = x, y = y)) +
  geom_line(size = 2) +
  scale_x_continuous("Residual variance") +
  scale_y_continuous("P") +
  theme_bw()

Just as an example, the following code adds informed priors on the slope and intercept coefficient, by increasing the precision of the prior (effectively reducing the variance or range of values). The mean of the priors is left at zero. Try running this and compare the estimates of coefficients and the WAIC/DIC goodness of fit values.

ufo_inla_lm2 <- inla(count ~ population,
                     data = ufos_2014,
                     control.fixed = list(mean = 0, prec = 0.2, 
                                          mean.intercept = 0, prec.intercept = 0.2),
                     control.family=list(hyper=list(prec = list(prior = 'loggamma', 
                                                              param = c(0.1, 0.1)))),
                     control.compute = list(dic = TRUE, waic = TRUE),
                     control.predictor = list(compute = TRUE))
summary(ufo_inla_lm2)
## 
## Call:
##    c("inla(formula = count ~ population, data = ufos_2014, control.compute 
##    = list(dic = TRUE, ", " waic = TRUE), control.predictor = list(compute 
##    = TRUE), control.family = list(hyper = list(prec = list(prior = 
##    \"loggamma\", ", " param = c(0.1, 0.1)))), control.fixed = list(mean = 
##    0, prec = 0.2, ", " mean.intercept = 0, prec.intercept = 0.2))") 
## Time used:
##     Pre = 5.05, Running = 0.276, Post = 0.33, Total = 5.66 
## Fixed effects:
##              mean    sd 0.025quant 0.5quant 0.975quant  mode kld
## (Intercept) 0.698 2.218     -3.657    0.699      5.049 0.699   0
## population  0.002 0.000      0.002    0.002      0.002 0.002   0
## 
## Model hyperparameters:
##                                         mean   sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.00 0.00       0.00     0.00
##                                         0.975quant mode
## Precision for the Gaussian observations       0.00 0.00
## 
## Expected number of effective parameters(stdev): 1.02(0.004)
## Number of equivalent replicates : 48.02 
## 
## Deviance Information Criterion (DIC) ...............: 571.83
## Deviance Information Criterion (DIC, saturated) ....: 55.34
## Effective number of parameters .....................: 2.18
## 
## Watanabe-Akaike information criterion (WAIC) ...: 577.38
## Effective number of parameters .................: 6.47
## 
## Marginal log-Likelihood:  -298.00 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

Spatial neighborhood and weight matrix

We’ll use a conditional autoregressive (CAR) model to account for spatial dependency in the UFO data. This is a very similar approach to the spatial error model we looked at in a previous lab. In INLA, this means adding a new random effect to the model using the f() function. There’s a few things we need to do to set this up:

  • Create an index in the dataset (this will be linked to the spatial weight matrix)
ufos_2014$idarea <- 1:nrow(ufos_2014)
  • Create a neighborhood structure (here we use the poly2nb() function from spdep)
ufos_nb <- poly2nb(ufos_2014)
  • Convert the neighborhood structure to a spatial weight matrix
ufos_W <- nb2mat(ufos_nb)

Quick plot of the spatial adjacencies:

plot(st_geometry(ufos_2014), reset = FALSE)
plot(ufos_nb, st_coordinates(st_centroid(ufos_2014)), add = TRUE, col = 2, lwd = 2)

With all that in place, we can build a spatial model. We’ll use the newer bym2 formulation of the CAR model. Note that the format is similar to specifying a random effect above: we need the index (idarea), the model (bym2) and the spatial weight matrix:

ufo_inla_bym <- inla(count ~ population + f(idarea, model = "bym2", graph = ufos_W), 
                     data = ufos_2014,
                     control.compute = list(dic = TRUE, waic = TRUE),
                     control.predictor = list(compute = TRUE)
)

Now let’s look at the output:

summary(ufo_inla_bym)
## 
## Call:
##    c("inla(formula = count ~ population + f(idarea, model = \"bym2\", ", " 
##    graph = ufos_W), data = ufos_2014, control.compute = list(dic = TRUE, 
##    ", " waic = TRUE), control.predictor = list(compute = TRUE))" ) 
## Time used:
##     Pre = 24.7, Running = 0.906, Post = 0.378, Total = 25.9 
## Fixed effects:
##               mean     sd 0.025quant 0.5quant 0.975quant   mode kld
## (Intercept) 34.525 14.753      5.487   34.524     63.531 34.525   0
## population   0.002  0.000      0.002    0.002      0.002  0.002   0
## 
## Random effects:
##   Name     Model
##     idarea BYM2 model
## 
## Model hyperparameters:
##                                             mean       sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.00e+00 0.00e+00      0.000    0.000
## Precision for idarea                    9.13e+05 3.53e+06      1.131 1079.850
## Phi for idarea                          3.26e-01 1.03e-01      0.151    0.317
##                                         0.975quant  mode
## Precision for the Gaussian observations   0.00e+00 0.000
## Precision for idarea                      1.17e+07 0.602
## Phi for idarea                            5.48e-01 0.296
## 
## Expected number of effective parameters(stdev): 2.00(0.002)
## Number of equivalent replicates : 24.48 
## 
## Deviance Information Criterion (DIC) ...............: 567.49
## Deviance Information Criterion (DIC, saturated) ....: 52.88
## Effective number of parameters .....................: 2.60
## 
## Watanabe-Akaike information criterion (WAIC) ...: 572.02
## Effective number of parameters .................: 6.08
## 
## Marginal log-Likelihood:  -281.16 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

The main change to the output is that we now have three hyperparameter values:

  • Precision for the Gaussian observations: this is the precision on the non-spatial residuals
  • Precision for idarea: this is the precision on the spatial field
  • Phi for idarea: this is the mixing parameter, which decomposes the precision for idarea into a spatial and non-spatial effect.

Note that the precision for the Gaussian term is very high (the variance is low). This is likely not very well constrained, so we’ll try to add a prior on the spatial random effect to improve this. Here we specify a penalized complexity prior - note that we need one for the precision term and one for phi. The format for these priors is given by \((U, \alpha)\), where \(\alpha\) is the probability of the hyperparameter \(\xi\) exceeding \(U\): (\(P(T(\xi) > U) = \alpha\))

prior_bym2 <- list(
  prec = list(
    prior = "pc.prec",
    param = c(1, 0.01)),
  phi = list(
    prior = "pc",
    param = c(0.5, 2 / 3))
)

And let’s rebuild the model:

ufo_inla_bym2 <- inla(count ~ population + f(idarea, model = "bym2", 
                                            graph = ufos_W,
                                            hyper = prior_bym2), 
                     data = ufos_2014,
                     control.compute = list(dic = TRUE, waic = TRUE),
                     control.predictor = list(compute = TRUE),
                     control.inla = list(strategy = "adaptive",int.strategy = "eb")
)
ufo_inla_bym2$summary.hyperpar
##                                                 mean           sd   0.025quant
## Precision for the Gaussian observations 1.784842e-04 3.889182e-05 1.133510e-04
## Precision for idarea                    1.127060e+04 9.992121e+03 3.562608e+03
## Phi for idarea                          6.477860e-01 9.705357e-02 4.676081e-01
##                                             0.5quant   0.975quant         mode
## Precision for the Gaussian observations 1.747749e-04 2.651107e-04 1.677363e-04
## Precision for idarea                    8.170725e+03 3.740793e+04 5.017224e+03
## Phi for idarea                          6.444866e-01 8.371989e-01 6.214203e-01

These look a little better (or at least more consistent). From the results we can map out the spatial effect. This is the random effect in space as modeled by the bym2 model. Summary values are held in ufo_inla_bym2$summary.random$idarea. Note that only the first 49 values here represent the spatial random field:

ufos_2014$u <- ufo_inla_bym2$summary.random$idarea$mean[1:49]
tm_shape(ufos_2014) + tm_fill("u", style = "jenks") +
  tm_borders() +
  tm_layout(main.title = "UFO model spatial random effect")

We can also extract the Bayesian estimates of the number of sightings from summary.linear.predictor:

ufos_2014$yhat <- ufo_inla_bym2$summary.linear.predictor$mean
tm_shape(ufos_2014) + tm_fill("yhat", style = "jenks") +
  tm_borders() +
  tm_layout(main.title = "State population 2014")

Rate model

We noted above that as our outcome variable is the count of sightings, a Poisson model would be a better choice. We can convert out model to use a Poisson distribution by adding the argument family = "poisson" (similar the the binomial model we looked at earlier). However, for these data, we may want to model them as rates, rather than as counts. A reasonable expectation for the count of something is that it will depend on the size of the underlying population. In this example, the number of sightings is strongly correlated with the population size of the each state (this is basically what we modeled in the previous section). Modeling rates means that we can investigate if the rate of sightings on a per capita basis is relatively higher or lower than the expected number of counts, if all states were equal. This gives us a model of relative rates. In a classic Poisson model, we model the log of counts (\(Y_i\)) as a function of covariates:

\[ log(Y_i) \sim \beta X_i + \epsilon \]

In a relative rate model, the observed count (\(Y_i\)) for location \(i\) is defined as the expected count for \(i\) multiplied by the relative rate (\(\theta_i\)):

\[ Y_i \sim Pois(E_i \theta_i) \]

And then we model the log of the rate term:

\[ log(\theta_i) = \beta X_i + \epsilon \]

To model this with INLA, we first need to estimate the expected value (\(E_i\)), the number of sightings in each state if all states have the same rate. We’ll do this in a two step process. First, calculate the rate (\(r_s\)) in the standard population. This is given by the total number of sightings in the data, divided by the total population:

rs <- sum(ufos_2014$count) / sum(ufos_2014$population)
rs
## [1] 0.002522805

This gives the per capita rate of UFO sightings for the US (about 1 in 400). Now to get \(E_i\) for each state, multiply the state population by this number:

ufos_2014$Ei <- ufos_2014$population * rs

We’ll also redefine the priors for the spatial model:

prior_bym2 <- list(
  prec = list(
    prior = "pc.prec",
    param = c(0.6 / 0.31, 0.01)),
  phi = list(
    prior = "pc",
    param = c(0.005, 1 / 10))
)

With this information, we can now build our model. We’ll start by simply modeling the counts with an intercept term. We specify the family as poisson and add the argument E to define the expected counts. We again include the bym2 model for the spatial effects. The final argument (control.inla) speeds up the calculation of the hyperparameters at the cost of some accuracy. You can ignore this.

ufo_inla_rr <- inla(count ~ 1 + f(idarea, model = "bym2", 
                           graph = ufos_W,
                           hyper = prior_bym2), 
             data = ufos_2014, family = "poisson", E = Ei, 
             control.compute = list(dic = TRUE, waic = TRUE),
             control.predictor = list(compute = TRUE),
             control.inla = list(strategy = "adaptive",int.strategy = "eb")
)
summary(ufo_inla_rr)
## 
## Call:
##    c("inla(formula = count ~ 1 + f(idarea, model = \"bym2\", graph = 
##    ufos_W, ", " hyper = prior_bym2), family = \"poisson\", data = 
##    ufos_2014, ", " E = Ei, control.compute = list(dic = TRUE, waic = 
##    TRUE), ", " control.predictor = list(compute = TRUE), control.inla = 
##    list(strategy = \"adaptive\", ", " int.strategy = \"eb\"))") 
## Time used:
##     Pre = 23.4, Running = 0.353, Post = 0.365, Total = 24.1 
## Fixed effects:
##              mean    sd 0.025quant 0.5quant 0.975quant  mode kld
## (Intercept) 0.033 0.047      -0.06    0.033      0.126 0.033   0
## 
## Random effects:
##   Name     Model
##     idarea BYM2 model
## 
## Model hyperparameters:
##                       mean   sd 0.025quant 0.5quant 0.975quant  mode
## Precision for idarea 4.759 1.30      2.660    4.610      7.738 4.328
## Phi for idarea       0.568 0.24      0.115    0.584      0.949 0.771
## 
## Expected number of effective parameters(stdev): 45.04(0.00)
## Number of equivalent replicates : 1.09 
## 
## Deviance Information Criterion (DIC) ...............: 420.40
## Deviance Information Criterion (DIC, saturated) ....: 106.46
## Effective number of parameters .....................: 45.18
## 
## Watanabe-Akaike information criterion (WAIC) ...: 414.14
## Effective number of parameters .................: 28.14
## 
## Marginal log-Likelihood:  -236.32 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

As before, we can extract and plot the spatial random field

ufos_2014$u <- ufo_inla_rr$summary.random$idarea$mean[1:49]
tm_shape(ufos_2014) + tm_fill("u", style = "jenks") +
  tm_borders() +
  tm_layout(main.title = "UFO model spatial random effect (RR model)")

In addition, we can plot the estimated relative rates for each state from the summary of the fitted values:

ufos_2014$RR <- ufo_inla_rr$summary.fitted.values[, "mean"]
tm_shape(ufos_2014) + tm_fill("RR", palette = "-inferno") +
  tm_borders() +
  tm_layout(main.title = "UFO sightings (relative rates)")

Lastly, we can estimate excedence probabilities. This is the probability that a given location has a relative rate that exceeds a given threshold. To estimate this, we need the marginal posterior distribution of relative rates for each state. To illustrate how this works, we’ll first extract the posterior distribution for Oregon.

rowid <- which(ufos_2014$name == "Oregon")
mpd <- ufo_inla_rr$marginals.fitted.values[[rowid]]
plot(mpd, type = 'l', xlab = "RR", ylab = "p")
abline(v = 2, lty = 2)

To get the probability of exceeding a relative rate of 2, we need to calculate the integral to the right hand side of the vertical line in the figure above. INLA comes with a function (inla.pmarginal()) that will do this for us. As this calculates the integral to the left hand side, we subtract the result from 1 to get the excedence probability

1 - inla.pmarginal(q = 2, marginal = mpd)
## [1] 0.9746807

Giving a 0.97 probability that sightings in Oregon exceed 2x the national average. Note that this function can also be used to estimate the credibility of coefficients being higher or lower than some threshold. We’ll now use R’s sapply() function to calculate this for all states, and map the results:

ufos_2014$exc <- sapply(ufo_inla_rr$marginals.fitted.values,
                        function(marg){1 - inla.pmarginal(q = 2, marginal = marg)})

tm_shape(ufos_2014) + tm_fill("exc", palette = "Blues") +
  tm_borders() +
  tm_layout(main.title = "Excedence probability (RR > 2)")

For a final model, we’ll add a covariate to our relative rate model, to see if there is a credible link to the rates of sightings. Here, we’ll load a data set with the number of airports per state, to test if the number of sightings is linked to this (and by extension the number of flights). Load this and log transform the number of airports.

airports <- read.csv("../datafiles/nuforc/airports.csv")
airports$lairports <- log(airports$airports)

Now, we’ll merge this with the 2014 UFO dataframe:

ufos_2014 <- merge( ufos_2014, airports, by.x = "postal", by.y = "state")
ggplot(ufos_2014, aes(x = airports, y = count)) +
  scale_x_log10("Airports") + scale_y_log10("Sightings") +
  geom_point()

ufo_inla_rr2 <- inla(count ~ lairports + 
               f(idarea, model = "bym2", 
                 graph = ufos_W,
                 hyper = prior_bym2), 
             data = ufos_2014, family = "poisson", E = Ei, 
             control.compute = list(dic = TRUE, waic = TRUE),
             control.predictor = list(compute = TRUE),
             control.inla = list(strategy = "adaptive",int.strategy = "eb")
)
summary(ufo_inla_rr2)
## 
## Call:
##    c("inla(formula = count ~ lairports + f(idarea, model = \"bym2\", ", " 
##    graph = ufos_W, hyper = prior_bym2), family = \"poisson\", ", " data = 
##    ufos_2014, E = Ei, control.compute = list(dic = TRUE, ", " waic = 
##    TRUE), control.predictor = list(compute = TRUE), ", " control.inla = 
##    list(strategy = \"adaptive\", int.strategy = \"eb\"))" ) 
## Time used:
##     Pre = 24.1, Running = 0.353, Post = 0.361, Total = 24.8 
## Fixed effects:
##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## (Intercept)  0.419 0.482     -0.529    0.420      1.364  0.421   0
## lairports   -0.068 0.084     -0.233   -0.068      0.097 -0.068   0
## 
## Random effects:
##   Name     Model
##     idarea BYM2 model
## 
## Model hyperparameters:
##                       mean    sd 0.025quant 0.5quant 0.975quant mode
## Precision for idarea 4.915 1.212      2.915    4.792      7.638 4.56
## Phi for idarea       0.142 0.144      0.005    0.093      0.542 0.01
## 
## Expected number of effective parameters(stdev): 45.72(0.00)
## Number of equivalent replicates : 1.07 
## 
## Deviance Information Criterion (DIC) ...............: 420.61
## Deviance Information Criterion (DIC, saturated) ....: 106.67
## Effective number of parameters .....................: 45.87
## 
## Watanabe-Akaike information criterion (WAIC) ...: 414.30
## Effective number of parameters .................: 28.51
## 
## Marginal log-Likelihood:  -242.86 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

And we’ll finish by plotting the posterior distribution of the coefficient for the log number of airports:

plot_df <- as.data.frame(ufo_inla_rr2$marginals.fixed$lairports)

ggplot(plot_df, aes(x = x, y = y))  +
  geom_line(size = 2) +
  scale_x_continuous("log(Airports)") +
  scale_y_continuous("p") +
  theme_bw()

# Files used in lab

Irish Education data set: irished.csv

Column header Variable
sex Sex of student (male = 0; female = 1)
DVRT Vocal reasoning test score
fathocc Prestige score of fathers occupation
lvcert Taken leaving certificate (yes = 1; no = 0)
schltype School type

GapMinder Dataset: gapminderData5.csv

Column header Variable
country Country
year Year
pop Population
continent Continent/Region
lifeExp Life expectancy (yrs)
gdpPercap Per Capita GDP (USD 2007)